IEEE Reliability Society Newsletter

What does a Reliability Engineer do?

Dr. Samuel Keene, FIEEE

I was recently asked that question about what do Reliability Engineers do? The short answer that I gave was that we help to make systems, components and software more reliable. We do this a lot by asking a lot of questions, combined with testing and analysis of components and designs, to measure and improve the reliability of the systems that we are concerned with. Often, it is a path of joint discovery working with the designer or programmer whose design we are analyzing. The goal is to make it easier on the domain knowledge holder, by doing the paper work myself, thereby optimizing the use of the designer’s time. Then I document the findings of our joint analysis. We have always found product improvement opportunities. And it would be a good validation of the design if no improvement opportunities were found. This practice demonstrates good diligence in either case.

Permit me to expand that answer citing some activities and highlights of my career (not in strict chronological order):

Becoming a Reliability Engineer

So I did my undergraduate work at U of Maryland in Physics. I joined Bendix radio in Towson, Maryland where Bendix wanted me to work in Reliability. That discipline has stuck with me throughout my career. I have also managed the greater assurance or specialty engineering field that includes: human factors, maintainability, EMC, safety, and power systems development. It has been a rewarding career.

Education and teaching experiences

My favorite high school classes were Plane Geometry and Physics. I also liked Mathematics. Physics particularly appealed to me since it applied mathematics to the real world. I graduated from U of M in Physics. I had an opportunity to get my MS in Physics from Drexel University through an industry cooperative night program in Baltimore. I had one course in Operations Research, in my MS program, which I loved. It applied mathematics to everyday problems. In OR, for example, one could optimize loading and assignment of truck loading and routing. You could optimize computer architectures for greater processing speed, and so forth. So when I chose my PhD program, I selected OR as my field of study. I completed my PhD at University of Colorado Business School. I took all the course work for a MBA degree.

I was always ready to take courses. IBM sent me to a one-week training course almost every year. This was in addition to the in-house courses that they offered. I loved learning. Also, I was a corporate trainer throughout my career, for every company that I worked for. I taught soft skill courses in: Creativity, Delegation, Problem Analysis, Decision Making, Transformational Leadership, and technical courses in Six Sigma, Failure Analysis, Requirements Development, and Reliability.

I also had a one year leave of absence to teach EE year at Prairie View A&M in Prairie View Texas, in 1973-74. There I taught Logic Design and Technical Writing courses and also I was the mentor of Senior Class Projects. I found the school to be rich in outreach to students. Most students came from poor circumstances. There was good human bonding between students and faculty.

Reliability Predictions

I began my career at Bendix Radio in Townson, MD, making reliability predictions of military electronic systems. This was using component stress analysis (based upon electrical stress levels, environmental stress levels, and specific application factors). This was primarily accomplished using MIL-HDBK-217. Many reliability folks disparage this book and think it is ill founded. I know it can be adapted to a given customers product and do a great job. This was the case for STK, when I worked there. They had mined their previous product field history and came up with failure rate adjustment factors. They were able to successfully model the future field reliability profile on newly developed projects. This is most helpful in knowing when a product is ready for release and how much service support will be needed once the product gets to the field.

When I was at NASA Goddard, I worked with Mr. Clifford Ryerson at Hughes, who was the principle engineer in developing the MIL-HDBK-217 handbook. I administered the contract with Clifford to refine and update the 217 failure rate model. I was also a liaison with Hughes and my employer, IBM, on the development of the first laser scanner products. Later I was brought in to put on a reliability training seminar at Hughes. I also did full time software reliability consulting during 1995-1999 for Hughes’ Wide Area Augmentation System (WAAS). This was an enhanced GPS based flight navigation system. Additionally, I was asked by Hughes to complete the Communications Systems Reliability analysis of the Hong Kong Airport. So I had five professional engagements with Hughes Aircraft Corporation.

Development of a Software Reliability Prediction Model-

This reliability model provides an early prediction of the delivered software and a growth profile for how the reliability is projected to grow over time, after release to the field. This model uses the CMMI reliability process rating system project the quality of the delivered code. This process rating is further tempered by the code extent, operational profile measures, and the developer’s historical field experience. This model will set a baseline for code reliability performance. This estimate can be upgraded as inspection and test data are obtained. I developed this model over 15 years ago and still receive inquiries from companies using it today. The model is included in the IEEE Practice on Software Reliability, IEEE Std 1633-2008.

PRISM Model Research-

In 1992, I was the principle investigator, along with Mr Bill Denson, Reliability Analysis Center (RAC), on developing an improved reliability prediction model called the PRISM model. This model used the latest Rome Air Development Center (RADC) field data, but now it included software failure rates, and also took into account design factors accounting for the quality of the development organization. This model gave developers particular areas to focus on to assure more reliable products. Their better development focus was rewarded with predictions of higher reliability for the products. E.g., This prediction process incentivized better printed wire board (PWB) design processes and layouts, along with fault localization capabilities.

Component testing

I was involved in qualifying component parts for Space Application while I was at NASA Goddard Space Flight Center and later at IBM in Boulder Colorado for computer systems. Two major things happened for me. First, at NASA, I ran extensive testing of solid tantalum capacitors for space flight qualification. This proved fortuitous later at IBM. I will mention that situation later. Secondly, when I took over the component qualification lab at IBM Boulder CO, the lab techs were writing 50 – 80 page reports on their test results. These technicians were not natural writers. And the developers for whom they were testing parts were not normally interested in all the test detail. They only wanted to know if the part passed or failed. So I came up with a 1 page template that contained all the vital information. A caveat was added that the detail test and qualification information was located in our lab files. It was available to anyone interested. As far as I know, no one ever requested the additional lab data. In fairness, the developers were regularly talking to the test folks so they kept abreast of the test progress. They were not relying on the big test report. So we better served our customer, and completed our work more efficiently, and raised the morale of the test technicians in our component test lab.

Failure analysis

I had only been at IBM for a week, when my office mate was absent. He had graduated from the same high school, Montgomery Blair in Silver Spring, Maryland. My colleague did not seem productive to me. He wanted to re-invent the wheel. In our conversations, he would talk about re-deriving basic formulas that are used in EE. These formulas were basic tools to be used. I could not imagine how he could get anything ever done.

During my first week, I had been given literature that I had been given to read about IBM. A man from the Product test area came into our cubical to drop off a failed Printed Wiring Board (PWB) for testing. I asked him, “What does my colleague do with the failed cards”? He said he took them to card test area in manufacturing” for analysis. I then got his permission to let me do that. I took the card to the test area and asked what they did with them. He explained they queued them up until they got a sufficient quantity to run them through card test. I asked if I could examine the card. Just using visual examination and continuity measurements, I located the fault. So I went back to the machine test area. I showed the discovered fault to the test engineer. He then took me over to a test machine and demonstrated how the failed card performance differed from a good card. Then he re-inserted the good card back in the machine and it had gone bad. Turned out he was “hot plugging” the card into the machine. This is a verboten but common and expedient practice. The high profile electronic parts were shorting out the adjacent card lands. We then changed the card spec to apply an insulative coating to protect the card from such failure susceptibility.

Then the card test lead showed me a cache of cards labeled “No Defect Found - twice”. They had been tested twice and found good in card test. They also had failed twice in the machine application on the floor. This was a contradiction. Every one of these cards was fruitful to find and resolve a problem. This failure contrast is an ideal failure analysis situation.

I found discrepancies’ between the card test environment and the actual machine application. I discovered these problems on my own. There was no push on me. I ended up defining my own job, and was resolving problems each week. Once I identified the problem, I ALWAYS engaged the knowledge holder (engineer or the designer) in my analyses. I did not want to reinvent the design. The designer did this best and the designer was the one to fix the fault. So I would isolate and identify the problem. Once the design person is presented the fault facts, the designer always seems to quickly grasp the problem and conjure up a solution. I never had to sell the designer a solution. It was like volley ball. I set up the designer and the designer drove home the solution. These were rewarding and exciting times for me. I always engaged the designers when I was analyzing their designs. I tried to make it a joint effort, and shared any recognition we might receive. Then designers started coming to me to help them with a potential problem. We were working and resolving problems that were largely below the radar.

Multimillion dollar day –

One day while I was in the Card test test area in manufacturing, I saw a 30 gallon trash can full of Solid tantalum capacitors, I asked what was going on? I was told that these capacitors were being replaced on PWB’s in machines being returned from the field. They wanted to make these cards equivalent to new. Then they could be recycled as new. They had 13 vendors doing the same part replacement, I happened to have worked on qualifying solid tantalum capacitors’ at NASA. I knew that these capacitors actually get better with age, at least in the short term. This process was wasting good parts and replacing them with less reliable parts. I shut it down in one day. It all resulted from asking questions and linking my past experience, to the present opportunity. I received an outstanding Contribution award for that initiative. Failure analysis was like solving a crime scene problem every day. Lot of fun and rewarding.

Hong Kong airport Reliability Modelling

The Reliability modeling analysis of the HK airport communications system was 2 years overdue. I was sent by Hughes to HK to complete the analysis. When I arrived there, I was able to team with a Mr Conway McGee, an Ex Pat from Brussels who was a systems engineer. Together we worked on the analysis using a single computer terminal. One of us would key in the program and data. This amounted to real time unit testing. Together, we completed the delinquent analysis in 5 weeks time. The report and analysis was accepted without any revisions. It was exhilarating to team up on this analysis. You could feel the synergy. It helped that we had complimentary skill sets. His understanding of the system architecture along with my reliability modeling background was synergistically effective. It was an exhilarating work situation.

I overheard the phrase “peer programming” at a subsequently held software reliability conference. I found out that this was an extreme programming tool, and it exactly described the process we used in the HK analysis. There were reports of 100X fewer faults being injected into the code during development. This is a marvelous tool for developing critical software. And it was a most wonderful development experience.

FAA WAAS support – Fullerton CA

I was brought in to Hughes Aircraft to do software reliability analysis of the GPS based flight navigation system. I worked TDY on this program for three years. I received an Appreciation Plague from the customer, FAA, on my contributions to the program. I had another personally gratifying experience at Hughes, though I did not find out the impact until 15 years later. I have always been called on to teach engineering assurance courses wherever I worked. I mention this because I met a gentleman this year at a Boeing conference where I gave a technical presentation. He told me that he took a software reliability course from me when we were both at Hughes (acquired my Raytheon). One element of that course was” rate monotonic analysis”. This led to a satisfying career change for him, getting him into software system requirements. How neat this was for me to learn of this impact. I also share that focus on developing good system requirements. The majority of system problems stem from deficiencies in the requirement statements.

Helium Neon (HeNe) laser scanner development

IBM developed the first HeNe laser scanner in 1974. The laser was invented in 1960 and this was the first high reliability industrial application. Previously it had been used in laboratories and in construction applications. Our initial sample s exhibited poor life characteristics. Some were like a “blue dot flash bulb”. Few lived more than 100 hours. We did have one that went 1,000 hours. We needed a 5,000 hour life laser scanner we were developing. This is the same laser you see today in super market check out stands.

When we dissected the initial failed, lasers we found different materials in side. We found one cathode with 2024 Aluminum, one with 6061 aluminum. We found temper T4 and temper T6 cathodes. We worked with the supplier to identify all possible variations in the construction of the laser. We identified possible variations in the current build of the laser plus those variaito9ns that the supplier thought might be fortuitous. Fortunately IBM had a visionary manager who invoked Design of Experiments (DOE) to examine the effects of the design variation, and finally, to set the optimized design operating point. We also had a DOE expert, William J Diamond, author of “Practical Experiments Design for Engineers”. Working with the suppliers design and manufacturing leaders, along with our materials experts, we identified 13 key variables to test. In school, we have been taught to test one variable at a time, keeping all the others fixed. This is abbreviated OFAT. This is not effective or efficient. For one thing, you miss the possible interactive effects of the variables. This is a big deal! E.g., The Ford Explorer had roll over problems WITH Bridgestone tires WHEN they were under inflated. Each variable, by itself, was not significant, but was significant in combination of the other two variables.

So the first designed experiment was a screening test to see which of the identified variables was significant. This test involved 13 variables. Subsequent testing then was carried out on the reduced set of significant variables. The purpose of this additional testing was to find the optimum operating point.

After 15 months of testing, the optimum operating point was determined. The laser performance stabilized and was never a field problem.

Six Sigma

I was invited to join the Six Sigma initiative at Seagate technology and 1999. This was an exciting opportunity for me. I viewed DOE as the “sweet spot” of Six Sigma”. I was I was put through an extensive Six Sigma training program while at the same time carrying out two product improvement projects. I became a Six Sigma Black belt and a Six Sigma Master Black Belt within six months of joining Seagate. I executed both process and product improvement projects. I taught both Greenbelt and Black belt courses leading to certification of the students. I also mentored Greenbelt and Black belt projects. I also headed up the Seagate worldwide Six Sigma corporate master black belt Council.

I was also invited by the American Society of quality (ASQ) to participate with 12 other master black belts to develop a body of knowledge(BOK) exam for black belt certification.

Sometimes people ask me if Six Sigma will continue viable. Six Sigma is really a tool box of best practices, tools and processes. It will always make sense to apply these best tools. The number of tools is open ended. For instance, I include the XP process tool called “pair programming” in the Six Sigma toolkit. I work with a colleague from Brussels to develop the Hong Kong airport communications system reliability model. We worked together on the same computer. I found out later that our process was called “pair programming”. So now I include pair programming as one of the tools of Six Sigma. The literature reports 10 to 100 fold reduction in latent faults in the code from this process. This is a natural “Design for Six Sigma” tool. So I include it, although I would be surprised if any other Six Sigma Belt would know about it or include it. I do think that they all would if they but knew of this tool and its power. One more caveat. Tools like “Pair Programming” must be judiciously applied to critical applications. The HK model was one of the critical applications.

Management, Researcher, Principle Investigator, and Instructor

Throughout my career, I have alternately filled the roles above. It has been easy for me to transition through these roles. Sometimes these roles are concurrent. For instance, I have always been an instructor throughout my career. Of course, my instructor role was full-time when I served as an adjunct professor for one year at Prairie view A and M University in Prairie View, Texas. Otherwise I taught at NASA, IBM, Storage Technology, Hughes (Raytheon), and Seagate Technolgy; all on a part time basis while serving as an individual contributor or manager.

I was recognized as manager of the year for IBM Boulder Colorado in 1982. I enjoyed management, especially in facilitating project work and in supporting my employees. It never bothered me to transition between the roles that I had. In fact, I found great satisfaction in the varied experiences that I had.

- Research and Publications

I have steadily researched and published throughout my career and have over 200 published papers and book chapters in System Engineering handbooks (1) and Reliability Engineering handbooks (2).

- Professional Activities

I have been active in the IEEE Reliability Society, The Technology Management Council, American Society of Quality, and the IEEE Computer Society. I have held many administrative positions in the IEEE including President of the Reliability Society, Board of Directors member for the Technology Management Council, and the Founding Chair and current chair of the Denver Chapter of the IEEE Reliability Society. I was also on the Quality Management Council for Colorado State University.

-Satellite training broadcasts and videos for IEEE, National Technological University, NASA.

I have enjoyed a “Dream career”: assuring products, promoting best practices, receiving good fellowship and counsel from the IEEE Reliability Society and its membership, receiving recognition from the IEEE as well as my employers and profession.

Download Full Article